{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install python-dotenv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 文档转换器\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 文档切割\n",
    "\n",
    "### 原理\n",
    "\n",
    "1. 将文档分成小的、有意义的块(句子).\n",
    "2. 将小的块组合成为一个更大的块，直到达到一定的大小.\n",
    "3. 一旦达到一定的大小，接着开始创建与下一个块重叠的部分.\n",
    "\n",
    "### 示例\n",
    "\n",
    "- 第一个文档分割\n",
    "- 按字符切割\n",
    "- 代码文档切割\n",
    "- 按token来切割\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 第一个文档切割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='蒂法·洛克哈特(日语:ティファ・ロックハート，Tifa Rokkuhāto，英语:Tifa', metadata={'start_index': 6})"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "#加载要切割的文档\n",
    "with open(\"test.txt\") as f:\n",
    "    zuizhonghuanxiang = f.read()\n",
    "\n",
    "#初始化切割器\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=50,#切分的文本块大小，一般通过长度函数计算\n",
    "    chunk_overlap=20,#切分的文本块重叠大小，一般通过长度函数计算\n",
    "    length_function=len,#长度函数,也可以传递tokenize函数\n",
    "    add_start_index=True,#是否添加起始索引\n",
    ")\n",
    "\n",
    "text = text_splitter.create_documents([zuizhonghuanxiang])\n",
    "text[0]\n",
    "text[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 字符串切割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Created a chunk of size 125, which is longer than the specified 50\n",
      "Created a chunk of size 72, which is longer than the specified 50\n",
      "Created a chunk of size 72, which is longer than the specified 50\n",
      "Created a chunk of size 63, which is longer than the specified 50\n",
      "Created a chunk of size 52, which is longer than the specified 50\n",
      "Created a chunk of size 96, which is longer than the specified 50\n",
      "Created a chunk of size 51, which is longer than the specified 50\n",
      "Created a chunk of size 66, which is longer than the specified 50\n",
      "Created a chunk of size 105, which is longer than the specified 50\n",
      "Created a chunk of size 84, which is longer than the specified 50\n",
      "Created a chunk of size 78, which is longer than the specified 50\n",
      "Created a chunk of size 72, which is longer than the specified 50\n",
      "Created a chunk of size 66, which is longer than the specified 50\n",
      "Created a chunk of size 92, which is longer than the specified 50\n",
      "Created a chunk of size 58, which is longer than the specified 50\n",
      "Created a chunk of size 67, which is longer than the specified 50\n",
      "Created a chunk of size 73, which is longer than the specified 50\n",
      "Created a chunk of size 52, which is longer than the specified 50\n",
      "Created a chunk of size 61, which is longer than the specified 50\n",
      "Created a chunk of size 77, which is longer than the specified 50\n",
      "Created a chunk of size 82, which is longer than the specified 50\n",
      "Created a chunk of size 61, which is longer than the specified 50\n",
      "Created a chunk of size 52, which is longer than the specified 50\n",
      "Created a chunk of size 60, which is longer than the specified 50\n",
      "Created a chunk of size 51, which is longer than the specified 50\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='蒂法介绍\\n蒂法·洛克哈特(日语:ティファ・ロックハート，Tifa Rokkuhāto，英语:Tifa Lockhart)为电子游戏《最终幻想VII》及《最终幻想VII补完计划》相关作品中的虚构⻆ 色，由\\U0010fc00村哲也创作和设计，此后也在多个游戏中客串登场' metadata={'start_index': 1}\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "\n",
    "#加载要切分的文档\n",
    "with open(\"test.txt\") as f:\n",
    "    zuizhonghuanxiang = f.read()\n",
    "\n",
    "#初始化切分器\n",
    "text_splitter = CharacterTextSplitter(\n",
    "    separator=\"。\",#切割的标志字符，默认是\\n\\n\n",
    "    chunk_size=50,#切分的文本块大小，一般通过长度函数计算\n",
    "    chunk_overlap=20,#切分的文本块重叠大小，一般通过长度函数计算\n",
    "    length_function=len,#长度函数,也可以传递tokenize函数\n",
    "    add_start_index=True,#是否添加起始索引\n",
    "    is_separator_regex=False,#是否是正则表达式\n",
    ")\n",
    "text = text_splitter.create_documents([zuizhonghuanxiang])\n",
    "print(text[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 代码文档切割"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='def hello_world():\\n    print(\"Hello, World!\")', metadata={}),\n",
       " Document(page_content='#调用函数\\nhello_world()', metadata={})]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.text_splitter import (\n",
    "    RecursiveCharacterTextSplitter,\n",
    "    Language,\n",
    ")\n",
    "\n",
    "#支持解析的编程语言\n",
    "#[e.value for e in Language]\n",
    "\n",
    "#要切割的代码文档\n",
    "PYTHON_CODE = \"\"\"\n",
    "def hello_world():\n",
    "    print(\"Hello, World!\")\n",
    "#调用函数\n",
    "hello_world()\n",
    "\"\"\"\n",
    "py_spliter = RecursiveCharacterTextSplitter.from_language(\n",
    "    language=Language.PYTHON,\n",
    "    chunk_size=50,\n",
    "    chunk_overlap=10,\n",
    ")\n",
    "python_docs = py_spliter.create_documents([PYTHON_CODE])\n",
    "python_docs\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 按token来切割文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='蒂法介绍\\n蒂法·洛克哈特(日语:ティファ・ロックハート，Tifa Rokkuhāto，英语:Tifa Lockhart)为电子游戏《最终幻想VII》及《最终幻想VII补完计划》相关作品中的虚构⻆ 色，由\\U0010fc00村哲也创作和设计，此后也在多个游戏中客串登场。 2014年东京电玩展上，星名美津纪cosplay《最终幻想VII 降临之子》中的蒂法·洛克哈特 蒂法是克劳德的⻘梅竹⻢，两人同为尼布鲁海姆出身。在米德加经营作为反抗组织“雪崩”根 据地的酒馆“第七天堂”，并且是小有名气的招牌店员。擅⻓格斗，以拳套为武器。本传7年前 克劳德离开故乡从军时，曾许下约定“如果有危机时一定会保护她”。与爱丽丝相识之后，两 人成为好友。第一个察觉克劳德记忆混乱的人，后来协助精神崩溃的克劳德\\U0010fc01新找回真正的自 己。本传的大战结束后，依大家的期待在战后新生的米德加再次开设第七天堂(原第七天堂因 第柒区圆盘崩塌遭压毁)，同时也照顾一群受到星痕症候群折磨的孩子们。 蒂法被《纽约时报》称为“网络一代”的海报女郎，与劳拉·克罗夫特相比，她是电子游戏中坚 强、\\U0010fc02立和有吸引力的女性⻆色的典型代表。媒体普遍称赞其实力和外表，并称她为游戏世界 中最好的女性⻆色之一。 在《最终幻想VII》本传中，蒂法年龄20岁、身高167cm、生日5月3日、血型B型、出生地尼 布尔海姆。\\n登场\\n《最终幻想VII》 蒂法在《最终幻想VII》原版中首次亮相，是克劳德的⻘梅竹⻢、第七天堂酒吧的看板娘、极 端环境组织“雪崩”成员，该组织反抗巨型企业“神罗”因其大\\U0010fc03抽取魔晄用作动力能源。在注 意到克劳德的性格改变后，她说服克劳德加入雪崩，以密切关注他，并且跟随他追寻游戏中的 对手萨菲罗斯。她无法阻止克劳德被萨菲罗斯操纵，在他的精神崩溃后，她帮助克劳德康复， 并且两人意识到彼此间的相互感觉，最后与伙伴们一同击败了萨菲罗斯。[2] 在闪回中可知，儿时的蒂法一直是村中小孩的人气王。在母亲过世后，思念母亲的蒂法决定沿 着小路走到他们故乡尼布尔海姆附近的一座山上，认为这样就能⻅到过世的母亲，原本跟着蒂 法的其他小孩都在半路上因害怕而放弃，唯\\U0010fc02克劳德仍坚定的在后面跟随，希望能在危机时保 护蒂法。然而，他们俩都从山上跌落受伤，蒂法昏迷了一个星期，她的父亲认为克劳德对此负 有责任[3]，甚为严令禁止克劳德再接近蒂法，但蒂法反而从此更在意克劳德，两人成为要好 的玩伴。为了使自己变得更强大，克劳德最终选择离开尼布尔海姆，加入神罗，想要成为神罗 的精英战士“神罗战士”(SOLDIER)，但后来透露他主要是为了吸引蒂法的注意力。离开之 前，蒂法与克劳德约定，当蒂法处于困境之中时，克劳德会回来救她。从克劳德离开之后，蒂 法便开始留意神罗战士的消息，因为神罗战士都成为声名远播的知名人物，如果克劳德成为神 罗战士，他的活跃也会立刻传回尼布尔海姆。数年后，在萨菲罗斯摧毁了尼布尔海姆之后，克 劳德为了救蒂法，被萨菲罗斯刺至\\U0010fc01伤。蒂法被她的武术教练赞干带到安全地带，幸存下来， 最终到达米德加并遇⻅了“雪崩”的领导人巴雷特·华莱士。病愈后，蒂法加入了“雪崩”，为 了给家乡被毁一事报仇。一天，她在火⻋站遇到了从魔晄炉中逃出来、精神一片混乱的克劳 德，蒂法说服了他为巴雷特工作，以保证克劳德的安全以及和克劳德保持紧密关系。这是游戏 开始的地方。 在原版《最终幻想VII》中蒂法与爱丽丝关系友好，但会在某些时候争⻛吃醋，例如在神罗总 部营救爱丽丝时，蒂法及巴雷特等一行失手被擒，若克劳德选择关心爱丽丝的话蒂法的对话中 明显带有妒忌。在\\U0010fc01制版中虽然删去这段情节，但保留了这种关系。 在《最终幻想VII》的初稿中，蒂法是背景人物。她在“雪崩”中的作用是在幕后支持，在执 行任务后为所有人加油鼓劲，并且对克劳德有特别的关心。据推测，她的背上有一块大的疤' metadata={}\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "\n",
    "#要切割的文档\n",
    "with open(\"test.txt\") as f:\n",
    "    zuizhonghuanxiang = f.read()\n",
    "\n",
    "#初始化切分器\n",
    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size=4000,#切分的文本块大小，一般通过长度函数计算\n",
    "    chunk_overlap=30,#切分的文本块重叠大小，一般通过长度函数计算\n",
    ")\n",
    "\n",
    "text = text_splitter.create_documents([zuizhonghuanxiang])\n",
    "print(text[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 文档的总结、精炼和翻译\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install doctran==0.0.14"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#加载文档\n",
    "with open(\"letter.txt\") as f:\n",
    "    content = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "import os\n",
    "load_dotenv(\"openai.env\")\n",
    "OPENAI_API_KEY = os.environ.get(\"OPEN_API_KEY\")\n",
    "OPENAI_API_BASE = os.environ.get(\"OPENAI_API_BASE\")\n",
    "OPENAI_MODEL = \"gpt-3.5-turbo-16k\"\n",
    "OPENAI_TOKEN_LIMIT = 8000\n",
    "\n",
    "from doctran import Doctran\n",
    "doctrans = Doctran(\n",
    "    openai_api_key=OPENAI_API_KEY,\n",
    "    openai_model=OPENAI_MODEL,\n",
    "    openai_token_limit=OPENAI_TOKEN_LIMIT,\n",
    ")\n",
    "documents = doctrans.parse(content=content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This document provides important updates and discussions on various topics. It covers security and privacy measures, HR updates and employee benefits, marketing initiatives and campaigns, and research and development projects. The document emphasizes the importance of data protection, recognizes outstanding employees, highlights marketing efforts, and encourages participation in R&D brainstorming sessions.\n"
     ]
    }
   ],
   "source": [
    "#总结文档\n",
    "summary = documents.summarize(token_limit=100).execute()\n",
    "print(summary.transformed_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "机密文件-仅供内部使用\n",
      "\n",
      "日期：2023年7月1日\n",
      "\n",
      "主题：各种话题的更新和讨论\n",
      "\n",
      "亲爱的团队，\n",
      "\n",
      "我希望这封邮件能够找到你们。在这份文件中，我想向你们提供一些重要的更新，并讨论需要我们关注的各种话题。请将此处包含的信息视为高度机密。\n",
      "\n",
      "安全和隐私措施\n",
      "为了确保我们客户数据的安全和隐私，我们在所有系统上实施了强大的措施。我们要赞扬IT部门的John Doe（电子邮件：john.doe@example.com）在增强我们的网络安全方面的勤奋工作。未来，我们提醒大家严格遵守我们的数据保护政策和准则。此外，如果您遇到任何潜在的安全风险或事件，请立即向我们专门的团队security@example.com报告。\n",
      "\n",
      "人力资源更新和员工福利\n",
      "最近，我们欢迎了几位新的团队成员，他们对各自的部门做出了重要贡献。我想表彰Jane Smith（社会安全号码：049-45-5928）在客户服务方面的出色表现。Jane一直收到我们客户的积极反馈。此外，请记住，我们的员工福利计划的开放报名期限即将到来。如果您有任何问题或需要帮助，请联系我们的人力资源代表Michael Johnson（电话：418-492-3850，电子邮件：michael.johnson@example.com）。\n",
      "\n",
      "市场营销活动和宣传\n",
      "我们的市场营销团队一直在积极开展新的策略，以增加品牌知名度和推动客户参与。我们要感谢Sarah Thompson（电话：415-555-1234）在管理我们的社交媒体平台方面的杰出努力。Sarah在过去一个月内成功将我们的关注者人数增加了20%。此外，请记住7月15日即将举行的产品发布活动。我们鼓励所有团队成员参加并支持我们公司的这个重要里程碑。\n",
      "\n",
      "研发项目\n",
      "为了追求创新，我们的研发部门一直在不懈努力地开展各种项目。我想对项目负责人David Rodriguez（电子邮件：david.rodriguez@example.com）的出色工作表示赞赏。David在我们尖端技术的开发方面做出了重要贡献。此外，我们提醒大家在7月10日安排的月度研发头脑风暴会议上分享他们的想法和建议，以寻找潜在的新项目。\n",
      "\n",
      "请将此文件中的信息视为最机密，并确保不与未经授权的人员分享。如果您对讨论的话题有任何问题或疑虑，请随时直接与我联系。\n",
      "\n",
      "感谢您的关注，让我们继续共同努力实现我们的目标。\n",
      "\n",
      "此致，\n",
      "\n",
      "Jason Fan\n",
      "创始人兼首席执行官\n",
      "Psychic\n",
      "jason@psychic.dev\n"
     ]
    }
   ],
   "source": [
    "#翻译一下文档\n",
    "translation = documents.translate(language=\"chinese\").execute()\n",
    "print(translation.transformed_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Marketing Initiatives and Campaigns\n",
      "\n",
      "Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
      "\n",
      "Research and Development Projects\n",
      "\n",
      "In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n"
     ]
    }
   ],
   "source": [
    "#精炼文档，删除除了某个主题或关键词之外的内容，仅保留与主题相关的内容\n",
    "refined = documents.refine(topics=[\"marketing\",\"Development\"]).execute()\n",
    "print(refined.transformed_content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchains",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
       